Notes: Machine Learning by Andrew Ng (Week3)
Classification and Representation
Classification
Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.
To attempt classification, one method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. However, this method doesn’t work well because classification is not actually a linear function.
Hypothesis Representation
Sigmoid Function (Logistic Function):
$$g(z) = \frac 1{1 + e^{-z}}$$
$h_{\theta} (x) = g(\theta^T x)$ will give us the probability that our output is 1.
- $h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta)$
- $P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1$
Decision Boundary
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
$h_\theta (x) \ge 0.5 \rightarrow y = 1$
$h_\theta (x) < 0.5 \rightarrow y = 0$
Decision boundary: the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.
E.g. $\theta = \begin{bmatrix} 5 \newline -1 \newline 0 \end{bmatrix}$
$y = 1$ if $5 + (-1)x_1 + 0x_2 \ge 0$
$\Rightarrow x_1 \le 5$
In this case, the decision boundary is a straight vertical line placed on the graph where $x_1 = 5$, and everything to the left of that denotes y = 1, while everything to the right denotes y = 0.
The input to the sigmoid function $g(z)$ (e.g. $\theta^T X$) doesn’t need to be linear, and could be a function that describes a circle (e.g. $z = \theta_0 + \theta_1 x_1^2 +\theta_2 x_2^2$) or any shape to fit our data.
Logistic Regression Model
Cost Function
We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.
The cost function for logistic regression:
$$J(\theta) = \frac 1m \sum_{i=1}^m \mathrm{Cost} ( h_{\theta} (x), y^{(i)})$$
$\begin{cases} \mathrm{Cost} ( h_{\theta} (x), y^{(i)}) = - \log (h_{\theta} (x)) \qquad & \text{if} \ y = 1 \newline \mathrm{Cost} ( h_{\theta} (x), y^{(i)}) = - \log (1 - h_{\theta} (x)) \qquad & \text{if} \ y = 0 \end{cases}$
We get the following plot for $J(\theta)$ vs $h_{\theta} (x)$:
![image][pic2]
$\mathrm{Cost}(h_\theta (x),y) = 0 \ $ if and only if $\ h_\theta(x) = y$.
Note that writing the cost function in this way guarantees that $J(\theta)$ is convex for logistic regression.
Simplified Cost Function and Gradient Descent
Compress the cost function’s two conditional cases into one case:
$\mathrm{Cost} (h_{\theta} (x), y) = -y \log (h_{\theta} (x)) - (1 - y) \log (1 - h_{\theta} (x))$
The entire cost function:
$J(\theta) = - \frac 1m \sum_{i=1}^m \left[ y^{(i)} \log (h_{\theta} (x^{(i)}) ) + (1 - y^{(i)}) \log (1 - h_{\theta} (x^{(i)})) \right]$
Gradient descent:
$\begin{align*} & Repeat \ \lbrace \newline & \quad \theta_j = \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \newline & \rbrace \end{align*}$
Notice that this algorithm is identical to the one used in linear regression.
Advanced Optimization
Notes: Machine Learning by Andrew Ng (Week2)
Multivariate Linear
Multiple Features
Multivariate linear regression: linear regression with multiple variables.
$\begin{align*} x_j^{(i)} &= \text{value of feature} \ j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the input (features) of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \text{the number of features} \end{align*}$
The multivariable form of the hypothesis function accommodating these multiple features is as follows:
$\begin{align*}h_{\theta}(x) &= \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x3 + \cdots + \theta_n x_n \newline &= \begin{bmatrix} \theta_0 & \theta_1 & \cdots & \theta_n \end{bmatrix} \begin{bmatrix} x_0 \newline x_1 \newline \vdots \newline x_n \newline \end{bmatrix} = \theta^T x \end{align*}$
Gradient Descent For Multiple Variables
The gradient descent equation:
$\text{repeat until convergence:} \ \lbrace$
$\quad \theta_0 = \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}$
$\quad \theta_1 = \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)}$
$\quad \theta_2 = \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)}$
$\quad \cdots$
$\rbrace$
In other words:
$\text{repeat until convergence:} \ \lbrace$
$\quad \text{simultaneously update} \ \theta_j \ \text{for } j = 0,…, n \ \lbrace$
$\qquad \theta_j = \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}$
$\quad \rbrace$
$\rbrace$
Feature Scaling
We can speed up gradient descent by having each of our input values in roughly the same range. This is because $\theta$ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
Modify the ranges of the input variables so that they are all roughly the same. Ideally: $-1 \le x_{(i)} \le 1 \ $ or $\ -0.5 \le x_{(i)} \le 0.5 $ (they aren’t exact requirements)
Feature scaling involves dividing the input values by the range ($max$ - $min$) of the input variable (or the standard deviation of the input variable) resulting in a new range of just 1.
Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
To implement both of these techniques, adjust your input values as shown in this formula:
$$x_i = \frac {x_i - \mu_i}{s_i}$$
$\mu_i$: the average
$s_i$: the range / the standard deviation
Learning Rate
Debugging gradient descent:
Make a plot with $\text{number of iterations}$ on the x-axis. Now plot the cost function $J(\theta)$ over the number of iterations of gradient descent. If $J(\theta)$ ever increases, then you probably need to decrease $\alpha$.
Automatic convergence test:
Declare convergence if $J(\theta)$ decreases by less than $E$ in one iteration, where $E$ is some small value such as $10^{-3}$. However in practice it’s difficult to choose this threshold value.
If learning rate $\alpha$ is sufficiently small, then $J(\theta)$ will decrease on every iteration.
Make sure gradient descent is working correctly!
To choose $\alpha$, try:
$…, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …$
To summarize:
- $\alpha$ is too small: very slow convergence
- $\alpha$ is too large: may not decrease on every iteration and may even diverge
Features and Polynomial Regression
There are several ways to improve the features and the form of the hypothesis function.
Such as, combine multiple features into one. For example, combine $x_1$ and $x_2$ into a new feature $x_3 = x_1 \cdot x_2$
We can change the hypothesis function by making it a quadratic, cubic or square root function (or any other form). Notably, if the features are chosen in this way, then feature scaling becomes very important.
E.g. if $x_1 \in (1, 1000)$, then $x_1^2 \in (1, 1000000)$ and $x_1^3 \in (1, 1000000000)$
Computing Parameters Analytically
Normal Equation
A second way of minimizing:
Minimize $J$ by explicitly taking its derivatives with respect to the $\theta_j$'s, and setting them to zero. This allows us to find the optimum $\theta$ without iteration.
The normal equation formula: $\ \theta = (X^T X)^{-1} X^T y$
There is no need to do feature scaling with the normal equation.
A comparison of gradient descent and the normal equation:
| Gradient Descent | Normal Equation |
|---|---|
| Need to choose $\alpha$ | No need to choose $\alpha$ |
| Needs many iterations | No need to iterate |
| $O(kn^2)$ | $O(n^3)$, need to calculate inverse of $X^T X$ |
| Works well when n is large | Slow if n is very large |
In practice, when n exceeds 10000 it might be a good time to go from the “Normal Equation” method to gradient descent.
Normal Equation Noninvertibility
When implementing the normal equation in octave we want to use the ‘$pinv$’ function rather than ‘$inv$’. The ‘$pinv$’ function will give you a value of $\theta$ even if $X^T X$ is not invertible.
If $X^T X$ is noninvertible, the common causes might be having:
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization”
Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.
Notes: Machine Learning by Andrew Ng (Week11)
Photo OCR
Problem Description and Pipeline
Photo Optical Character Recognition (OCR): Converse images of typed, handwritten or printed text into machine-encoded text.
Photo OCR pipeline:
1 | st=>start: image |
Sliding Windows
In the step of text detection, we try to find the regions of text appearing in the image. To begin with, take a simpler version of detection, pedestrian detection, as an example. This problem is simpler because it uses a fixed aspect ratio for those rectangles.
Supervised learning for pedestrian detection:
$x$ = pixels in $82 \times 36$ image patches, and collect large training sets of positive and negative examples of images.
Sliding window detection:
Take a rectangular patch of an image, and run that image patch through the classifier to determine whether or not there is a pedestrian in the image patch. Then, slide the window further to the right and run the new patch through the classifier again.
The amount of sliding the rectangle over each time is a parameter called $step \ size$ or $stride$. If we slide the window 1 pixel at a time, $step \ size = 1$. It is more common to use a step size of 4 or 8 pixels.
Start with a small rectangle which could only detect pedestrians of one specific size, then switch to some larger image patches. After taking a larger image patch, we should resize the image down to the same size, $82 \times 36$, and then pass it through the classifier.
Return to text detection.
To illustrate, run sliding windows at one fixed scale for this example. Use “white” color to show that the classifier think text appear in that region. In the lower left image, the different shades of gray corresponds to different probabilities that there exists text in that location.
We want to draw rectangles around all the region where the text in the image, so we take one more step to take the output of the classifier and apply it to an expansion operator. Mathematically, it color the gap white between two white regions, if the distance between a white region and its leftmost white region is smaller than a narrow size.
Having found these rectangles with the text in them, then cut out these image regions and use later stages of pipeline to recognize the texts.
1D Sliding window for character segmentation:
Decide whether or not there is a split between two characters.
Getting Lots of Data and Artificial Data
For more training examples:
- Take characters from different fonts and paste these characters against different random backgrounds.
- Synthesize data by introducing distortions.
Distortion introduced should be representation of the type of noise / distortions in the test set.
Usually does not help to add purely random / meaningless noise to the data.
Discussion on getting more data:
- Make sure the classifier is low-bias before expending the effort (plot learning curves). E.g. keep increasing the number of features / number of hidden units in neural network until the classifier is low-bias.
- “How much work would it be to get 10x as much data as we currently have?”
Ceiling Analysis: What Part of the Pipeline to Work on Next
Estimate the errors due to each component (ceiling analysis):
What part of the pipeline should be spent the most time trying to improve?
| Component | Accuracy |
|---|---|
| Overall system | 72% |
| Text detection | 89% |
| Character recognition | 90% |
| Character recognition | 100% |
The overall system currently has 72% accuracy.
Go through the first module of the machine learning pipeline, the text detection. Manually label the location of the text in the test set. In other word, to simulate a text detection system with one hundred percent accuracy.
Then, use these ground truth labels to feed them into the next step of the pipeline the character segmentation. Manually label the correct segmentation of the text into individual characters, and see how much that helps.
We can understand what is the upside potential of improving each of the components. If we get perfect text detection, the performance went up from 72% to 89%, which is a 17% performance gain. Hence, if we spend a great amount of time improving text detection, for the current system, we could potentially improve the system’s performance by 17%. However, no matter how much time is spent on character segmentation, the performance went up only by 1%.
Review:
Suppose you perform ceiling analysis on a pipelined machine learning system, and when we plug in the ground-truth labels for one of the components, the performance of the overall system improves very little. This probably means:
- It is probably not worth dedicating engineering resources to improving that component of the system.
- If that component is a classifier training using gradient descent, it is probably not worth running gradient descent for 10x as long to see if it converges to better classifier parameters.
Notes: Machine Learning by Andrew Ng (Week10)
Gradient Descent with Large Datasets
Learning With Large Datasets
“It’s not who has the best algorithm that wins. It’s who has the most data.”
It is reasonable to randomly pick a subset with 1000 examples before training the model with massive data.
If the training objective looks like a high-variance learning algorithm (the left plot above), it is feasible to add extra training examples to improve performance. However, if the training objective looks like a high-bias learning algorithm (the right plot above), it seems unlikely to improve performance by increasing the number of training examples, $m$, from 1000 to 1000000000.
Large-scale machine learning is committed to utilizing computationally efficient way to deal with very large datasets.
Stochastic Gradient Descent
Gradient descent will be computationally expensive if the training set is very large. Therefore, Stochastic gradient descent is able to scale algorithms to much bigger training sets.
Stochastic gradient descent algorithm:
$cost \left( \theta , (x^{(i)}, y^{(i)}) \right) = \frac 12 \left( h_{\theta} (x^{(i)}) - y^{(i)} \right) ^2$
$J_{train} (\theta) = \frac 1m \sum_{i=1}^m cost \left( \theta , (x^{(i)}, y^{(i)}) \right)$
- Randomly shuffle (reorder) training examples
- Repeat { // 1-10 times
$\quad for \ i = 1: m \ $ {
$\quad \quad \theta_j = \theta_j - \alpha \left( h_{\theta} (x^{(i)} - y^{(i)}) \right) x_j^{(i)} \ $ (for every $j = 0, …, n$)
}
}
What Stochastic gradient descent do is scanning through the training examples:
First, the algorithm looks at only the 1st training example ($x^{(1)}, y^{(1)}$), and takes a little gradient descent step with respect to the cost of this 1st example. In other word, the algorithm looks at the 1st training example and modify the parameters a little bit to fit the 1st training example a little bit better. Second, go on to the 2nd training example and take another little step in parameter space. And then, the 3rd training example and so on.
Randomly shuffling is helpful to speed up the conversion of Stochastic gradient descent.
More importantly, Stochastic gradient descent take the gradient term using just one single training example and start to make progress in enhancing the parameters already, while Batch gradient descent waits to sum up all gradient terms over all $m$ training examples.
Stochastic gradient descent will generally move the parameters in the direction of the global minimum, but not always. Actually, Stochastic gradient descent doesn’t converge in the same sense as Batch gradient descent, and it ends up wandering around continuously in some region which is close to the global minimum.
By running Stochastic gradient descent, we get a set of parameters near the global minimum and that’s good enough for the most practical purposes.
Mini-Batch Gradient Descent
Batch gradient descent: Use all $m$ examples in each iteration.
Stochastic gradient descent: Use 1 example in each iteration.
Mini-batch gradient descent: Use $b$ examples in each iteration.
$b$ is the mini-batch size, and a typical value for $b$ is 10, [2, 100] is a typical range of values for $b$.
For instance, let $b = 10, m = 1000$.
Repeat {
$\quad for \ i = 1,11,21,31,…,991 \ $ {
// (for every $j = 0, …, n$)
$\theta_j = \theta_j - \alpha \frac 1{10} \sum_{k=i}^{i+9} \left( h_{\theta} (x^{(i)} - y^{(i)}) \right) x_j^{(i)} \ $
}
}
Mini-Batch Gradient Descent can run even faster than Stochastic gradient descent, only if there exists a good vectorized Implementation.
Stochastic Gradient Descent Convergence
Check for convergence:
During learning, compute $cost \left( \theta , (x^{(i)}, y^{(i)}) \right)$ before updating $\theta$ using ($x^{(i)}, y^{(i)}$).
Every 1000 iterations (say), plot $cost \left( \theta , (x^{(i)}, y^{(i)}) \right)$ averaged over the last 1000 examples processed by algorithm.
$(a)$ Stochastic gradient descent doesn’t just converge to the global minimum, the parameters will oscillate a little bit around the global minimum. Therefore, using a smaller learning rate $\alpha$ will end up with smaller oscillations.
$(b)$ Change the number of training examples from 1000 to 5000 to get a smoother curve.
$\left( c \right)$ It looks like the algorithm is not learning. Increase $b$, it is possible that the cost is decreasing shown by the red line. The blue line was so noisy you couldn’t see the actual trend in the cost is very hard to be spotted. Actually decreasing and possibly averaging over 5,000 examples instead of 1,000 may help. If the curve is still flat just like the magenta one, then it is a more firm verification that the algorithm does not learn much.
$(d)$ It is a sign that the algorithm is diverging. What we should do is to use a smaller value of the learning rate $\alpha$.
Learning rate $\alpha$ is typically held constant. It is feasible to slowly decrease $\alpha$ over time if we want $\theta$ to converge. (E.g. $\alpha = \frac {const1}{iterationNumber + const2}$)
Advanced Topics
Online Learning
Online learning is an algorithm to learn something from a continuous stream of data.
Online learning example 1:
Shipping service website where user comes, specifies origin and destination, and offers to ship their package for some asking price, and users sometimes choose to use this shipping service ($y = 1$), sometimes not ($y = 0$).
Feature $x$ capture properties of users, of origin / destination and asking price. We want to learn $p(y = 1|x; \theta)$ to optimize price.
If the owner of the shipping website service could estimate the chance that a user will agree to use the shipping service for any given price, then, he should try to pick a price so that the user has a pretty high probability of choosing this website while simultaneously offering himself a fair profit for shipping the package.
Repeat forever {
Get $(x, y)$ corresponding to a user.
Update $\theta$ using $(x, y)$:
$\theta_j = \theta_j - \alpha \left( h_{\theta}(x) - y \right) \cdot x_j \ (j = 0, …, n)$
}
Use $(x, y)$ instead of previous notion $(x^{(i)}, y^{(i)})$, because the online learning algorithm looks at a certain example only one time, learn from it, then discard it and never use it again.
In particular, this online learning algorithm can adapt to changing user preferences.
Online learning example 2:
Product search (learning to search)
Users search for “Android phone 1080p camera”. There are 100 phones in store, and
what an online learning algorithm should do here is that it is supposed to figure out picking 10 phones out of 100 and return a search response to users.
$x =$ features of phone, how many words in query match description of phone, etc.
$y = 1$ if user clicks on link. $y = 0$ otherwise.
Learn $p(y = 1 | x; \theta)$: the problem of learning the predicted click-through rate (CTR).
Use the probability above to show users the 10 phones they are most likely to click on.
To be clear, every time a user does a search, the online learning algorithm will return 10 pairs of $(x, y)$, which also gives the algorithm itself 10 training examples, because for the 10 phones chosen to show the user, a feature vector $X$ and a result value of $y$ will be gotten for each of those 10 phones. Repeat this process continuously, the online learning algorithm is also updating the parameters using 10 steps of gradient descent on the 10 examples.
Map Reduce and Data Parallelism
Assume we want to fit a linear regression model or a logistic regression model, and begin with batch gradient descent. Assume there are $m = 400$ training examples, for a simpler representation.
Batch gradient descent: $\theta_j = \theta_j - \alpha \frac 1{400} \sum_{i=1}^{400} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) \cdot x_j^{(i)}$
- Machine 1: Use $(x^{(1)}, y^{(1)}), …, (x^{(100)}, y^{(100)})$.
$\color {blue} {temp_j^{(1)} = \sum_{i=1}^{100} \left( h_{\theta}(x^{(i)}) - y^{(i)}\right) \cdot x_j^{(i)}}$ - Machine 2: Use $(x^{(101)}, y^{(101)}), …, (x^{(200)}, y^{(200)})$.
$\color {blue} {temp_j^{(2)} = \sum_{i=101}^{200} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) \cdot x_j^{(i)}}$ - Machine 3: Use $(x^{(201)}, y^{(201)}), …, (x^{(300)}, y^{(300)})$.
$\color {blue} {temp_j^{(3)} = \sum_{i=201}^{300} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) \cdot x_j^{(i)}}$ - Machine 4: Use $(x^{(301)}, y^{(301)}), …, (x^{(400)}, y^{(400)})$.
$\color {blue} {temp_j^{(4)} = \sum_{i=301}^{400} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) \cdot x_j^{(i)}}$
After all these 4 machines computed their $temp_j$ variables, these variables will be sent to a centralized master server and be combined together. Then, the master server will update all parameters $\theta_j$ by
$\qquad \theta_j = \theta_j - \alpha \frac 1{400} (temp_j^{(1)} + temp_j^{(2)} + temp_j^{(3)} + temp_j^{(4)}), \ j = 1, …, n$
Map-reduce divides up the work load on 4 machines instead of summing up all 400 training examples on just one machine so that it can speed up the learning algorithm.
Comparing with parallel computing over multiple machines, parallel computing within a singer machine using a CPU with multiple cores does not need to worry about network latency. All the communication happens within a singer machine, so network latency becomes much less of an issue.
Map-reduce and summation over the training set:
Many learning algorithms can be expressed as computing sums of functions over the training set.
Notes: Machine Learning by Andrew Ng (Week9)
Density Estimation
Problem Motivation
Anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.
Taking quality assurance test on aircraft engines as an example. In the test, the heat generated and the vibration, as the features of the aircraft engine, would be measured. If there is a new aircraft engine produced next day, we want to know whether the new engine should undergo further test.
- $p(x_{test}) < \epsilon \rightarrow “anomaly”$
- $p(x_{test}) \ge \epsilon \rightarrow “ok”$
Anomaly detection example:
-
Fraud detection:
$x^{(i)}$ = features of user $i$'s activities.
Model $p(x)$ from data.
Identify unusual users by checking which have $p(x) < \epsilon$. -
Manufacturing
-
Monitoring computers in a data center:
$x^{(i)}$ = features of machine $i$.
$x_1$ = memory use, $x_2$ = number of disk accesses/sec,
$x_3$ = CPU load, $x_4$ = CPU load/network traffic…
Gaussian Distribution (Normal Distribution)
$x \in \mathbb R$. If $x$ is a distributed Gaussian with mean $\mu$, variance $\sigma^2$: $\ X \sim \mathcal N(\mu, \sigma^2)$
$$p(x; \mu, \sigma^2) = \frac 1{\sqrt {2\pi} \sigma} e^{-\frac {(x - \mu)^2}{2 \sigma^2} }$$
$$\mu = \frac 1m \sum_{i=1}^m x^{(i)}, \quad \sigma^2 = \frac 1m \sum_{i=1}^m (x^{(i)} - \mu)^2$$
Anomaly Detection Algorithm
We are going to model $p(x)$ from training set {$x^{(1)}$, …, $x^{(m)}$}, each example $x \in \mathbb R^n$
$$p(x) = p(x_1; \mu_1, \sigma_1^2) \times p(x_2; \mu_2, \sigma_2^2) \times p(x_3; \mu_3, \sigma_3^2) \times \cdots \times p(x_n; \mu_n, \sigma_n^2)$$,
where $x1 \sim \mathcal N(\mu_1, \sigma_1^2)$, $\ x2 \sim \mathcal N(\mu_2, \sigma_2^2)$, $\ x3 \sim \mathcal N(\mu_3, \sigma_3^2)$ … (the problem of estimating the distribution $p(x)$ is sometimes called the problem of density estimation).
Hence, the algorithm is:
- Choose features $x_i$ that might be indicative of anomalous examples.
- Fit parameters $\mu_1, …, \mu_n, \sigma_1^2, …, \sigma_n^2$
$$\mu_j = \frac 1m \sum_{i=1}^m x_j^{(i)}, \quad \sigma_j^2 = \frac 1m \sum_{i=1}^m (x_j^{(i)} - \mu_j)^2$$ - Given new example $x$, compute $p(x)$:
$$p(x) = \prod_{j=1}^n p(x_j; \mu_j, \sigma_j^2) = \prod_{j=1}^n \frac 1{\sqrt {2 \pi} \sigma_j} exp(- \frac {(x_j - \mu_j)^2}{2 \sigma_j^2})$$ Anomaly if $p(x) < \epsilon$
Building an Anomaly Detection System
Developing and Evaluating an Anomaly Detection System
The importance of real-number evaluation: when Developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evaluating our learning algorithm.
If there is an extra feature, should this new feature be included or not? Hence, we should run the algorithm with and without the feature, and get back a number indicating that adding the new feature improve or worsen performance, which is a much better way to decide whether that feature should be included or not.
Therefore, in order to be able to develop an anomaly detection system quickly, it would be a really helpful to have a way of evaluating an anomaly detection system.
Assume we have some labeled data, of anomalous and non-anomalous examples. (y = 0 if normal, y = 1 if anomalous). Then, the process of developing and evaluating an anomaly detection algorithm is as follows:
- Training set: $x^{(1)}, x^{(2)}, …, x^{(m)}$ (assume normal examples / not anomalous, and it will still work if a few anomalies slip into the unlabeled training set)
- Cross validation set: $(x_{cv}^{(1)}, y_{cv}^{(1)}), …, (x_{cv}^{(m_{cv})}, y_{cv}^{(m_{cv})})$
- Test set: $(x_{test}^{(1)}, y_{test}^{(1)}), …, (x_{test}^{(m_{test})}, y_{test}^{(m_{test})})$
For instance, here is the aircraft engines motivating example:
- 10000 good engines (normal)
- 20 (2~50) flawed engines (anomalous):
Training set: 6000 good engines
CV: 2000 good engines (y = 0), 10 anomalous (y = 1)
Test: 2000 good engines (y = 0), 10 anomalous (y = 1)
Algorithm evaluation:
Fit model $p(x)$ on training set {$x^{(1)}, …, x^{(m)}$}
On a cross validation / test example $x$, predict
$$ y =
\begin{cases}
1 &\ \text{if} \ p(x) < \epsilon \ \text{(anomaly)} \cr
0 &\ \text{if} \ p(x) \ge \epsilon \ \text{(normal)}
\end{cases}
$$
Possible evaluation metrics:
- True positive, false positive, false negative, true negative
- Precision / Recall
- $F_1-score$
Can also use cross validation set to choose parameter $\epsilon$
Anomaly Detection vs. Supervised Learning
| Anomaly detection | Supervised learning |
|---|---|
| Very small number of positive (y = 1) examples (0~20 is common), and large number of negative (y = 0) examples which can fit $p(x)$ very well | Large numberof positive and negative examples |
| Many different “types” of anomalies, which is hard for any algorithm to learn what the anomalies look like from positive examples; future anomalies may look nothing like any of the anomalous examples we have seen so far | Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples are likely to be similar to ones in the training set |
A key difference is that, in anomaly detection. there is such a small number of positive examples which is not possible for a learning algorithm to learn, so instead the large set of negative examples can be used to learn $p(x)$.
Choosing What Features to Use
In the anomaly detection algorithm, one of the most important things is modeling the features using the sort of Gaussian distribution. Hence, it is recommended to plot the data or the histogram of the data to make sure that the data looks vaguely Gaussian, before feeding it to the anomaly detection algorithm.
After plotting the histogram, if it looks non-gaussian, then we should try with different transformations of the data in order to make it look more Gaussian.
Error analysis for anomaly detection:
Want $p(x)$ large for normal examples $x$ and $p(x)$ small for anomalous examples $x$.
Most common problem:
$p(x)$ is comparable (say, both large) for normal and anomalous examples.
In the picture above, there is an anomalous examples that have been drawn green. It gets a pretty high probability so that the algorithm fails to flag this as an anomalous example. What we should do is that, look at the training examples and find something unusual which could inspire us to come up with a new feature $x_2$ that helps to distinguish between the bad example.
Thus, after creating the new feature $x_2$, re-plot the data (the scatter in the right). For all the anomalous examples, the feature $x_2$ takes on the unusual value.
In short, the way to choose features is to choose features that will take on either very large values or very small values, which are likely to turn out to be anomalies.
Also, we can create new features by combining the existing features, such as $x_3 = \frac {x_1}{x_2}$, $x_4 = \frac {(x_1)^2}{x_2}$. By creating features like these, we can start to capture anomalies that correspond to unusual combinations of values of the features.
Multivariate Gaussian Distribution
Multivariate Gaussian Distribution
If there is a new example, using the green cross ($x_1 = 0.4, x_2 = 1.5$) to express, which has a very low CPU load and a high memory use. It looks like that should be an anomaly. Nonetheless, according to the anomaly detection algorithm, both of the features, $x_1$ and $x_2$, have a reasonably high probability. Thus, an anomaly detection algorithm will fail to flag this green cross as an anomaly.
Using the multivariate Gaussian distribution to fix this problem.
Multivariate Gaussian (Normal) Distribution
$x \in \mathbb R^n$. This time, don’t model $p(x_1), p(x_2), …,$ etc. separately.
Model $p(x)$ all in one go.
Parameters: $\mu \in \mathbb R^n, \Sigma \in \mathbb R^{n \times n}$ (covariance matrix), $|\Sigma|$ = determinant of $\Sigma$ = $\color {darkblue}{det(\Sigma)}$
$$p(x; \mu, \Sigma) = \frac 1{(2 \pi)^{\frac n2} |\Sigma|^{\frac 12}} exp(-\frac 12 (x - \mu)^T \Sigma^{-1} (x - \mu))$$
Multivariate Gaussian examples:
- For a specific value of $x_1$ and for a specific value of $x_2$, the height of the surface is the value of $p(x)$.
- $\Sigma$ is a covariance matrix measuring the variance of the features. Shrinking $\Sigma$ would end up with a narrower distribution while incresing $\Sigma$ would end up with a wider and flatter Gaussian.
- The multivariate distribution can be used to model correlations between the data, by changing the off diagonal entries of $\Sigma$.
- Vary the mean parameter $\mu$ to change the center of the distribution.
Anomaly Detection Using the Multivariate Gaussian Distribution
Given training set {$x^{(1)}, x^{(2)}, …, x^{(m)}$} with each of $x^{(i)} \in \mathbb R_n$.
- Fit model $p(x)$ by setting
$\mu = \frac 1m \sum_{i = 1}^m x^{(i)}$
$\Sigma = \frac 1m \sum_{i=1}^m (x^{(i)} - \mu) (x^{(i)} - \mu)^T$ - Given a new example $x$, compute
$p(x; \mu, \Sigma) = \frac 1{(2 \pi)^{\frac n2} |\Sigma|^{\frac 12}} exp(-\frac 12 (x - \mu)^T \Sigma^{-1} (x - \mu))$
Flag an anomaly if $p(x) < \epsilon$
Relationship to the original model
| Original model | Multivariate Gaussian |
|---|---|
| $p(x_1; \mu_1, \sigma_1^2) \times \cdots \times p(x_n; \mu_n, \sigma_n^2)$ | $p(x; \mu, \Sigma) = \frac 1{(2 \pi)^{\frac n2} \lvert \Sigma \rvert^{\frac 12}} exp(-\frac 12 (x - \mu)^T \Sigma^{-1} (x - \mu))$ |
| Manually create features to capture anomalies where $x_1$, $x_2$ take unusual combinations of values, such as $x_3 = \frac {x_1}{x_2}$ | Automatically captures correlations between features |
| Computationally cheaper (alternatively, scales better to large $n$) | Computationally more expensive |
| OK even if $m$ (training set size) is small | Must have $m > n$ or else $\Sigma$ is non-invertible ($m \ge 10n$) |
Original model corresponds to multivariate Gaussian $p(x; \mu, \Sigma)$ where $\Sigma = \begin{bmatrix} \sigma_1^2 & & & \cr & \sigma_2^2 & & \cr & & \ddots & \cr & & & \sigma_n^2 \end{bmatrix}$
Predicting Movie Ratings
Problem Formulation
Example: Predicting movie ratings (user rates movies using 0 to 5 stars)
| Movie | Alice(1) | Bob(2) | Carol(3) | Dave(4) |
|---|---|---|---|---|
| Love at last | $\color{blue} 5$ | $\color{blue} 5$ | $\color{blue} 0$ | $\color{blue} 0$ |
| Romance forever | $\color{blue} 5$ | $\color{blue} ?$ | $\color{blue} ?$ | $\color{blue} 0$ |
| Cute puppies of love | $\color{blue} ?$ | $\color{blue} 4$ | $\color{blue} 0$ | $\color{blue} ?$ |
| Nonstop car chases | $\color{blue} 0$ | $\color{blue} 0$ | $\color{blue} 5$ | $\color{blue} 4$ |
| Swords vs. karate | $\color{blue} 0$ | $\color{blue} 0$ | $\color{blue} 5$ | $\color{blue} ?$ |
$n_u$ = number of users, $\ n_m$ = number of movies
$r(i, j)$ = 1 if user $j$ has rates movie $i$
$y^{(i, j)}$ = rating given by user $j$ to movie $i$ (defined only if $r(i, j)$ = 1)
The recommender system problem is that, given the data $r(i, j)$ and $y^{(i, j)}$, look through the data and try to predict the missing rating values of those non-rated examples (movies in this case), in order to recommend new movies which might be interesting to a user.
Content Based Recommendations
Content-based recommender systems
For each user $j$, learn a parameter $\theta^{(j)} \in \mathbb R^{n+1}$. Predict user $j$ as rating movie $i$ with $(\theta ^{(j)}) ^T x^{(i)}$ stars.
This is basically a linear regression problem.
Problem formulation
$r(i, j)$ = 1 if user $j$ has rates movie $i$ (0 otherwise)
$y^{(i, j)}$ = rating by user $j$ on movie $i$ (if defined)
$\theta ^{(j)}$ = parameter vector for user $j$
$x^{(i)}$ = feature vector for movie $i$
$m^{(j)}$ = number of movies rated by user $j$
For user $j$, movie $i$, predict rating: $(\theta ^{(j)}) ^T x^{(i)}$
Optimization objective
To learn $\theta ^{(j)}$ (parameter for user $j$):
$$\underset {\theta ^{(j)}}{min} \frac 12 \sum_{i:r(i, j) = 1} [(\theta ^{(j)}) ^T x^{(i)} -y^{(i, j)}] ^2 + \frac {\lambda}2 \sum_{k=1}^n (\theta_k^{(j)}) ^2$$
To learn $\theta ^{(1)}$, $\theta ^{(2)}$, …, $\theta ^{(n_u)}$:
$$\underset {\theta ^{(1)}, …, \theta ^{(n_u)}}{min} \frac 12 \sum_{j=1}^{n_u} \sum_{i:r(i, j) = 1} [(\theta ^{(j)}) ^T x^{(i)} -y^{(i, j)}] ^2 + \frac {\lambda}2 \sum_{j=1}^{n_u} \sum_{k=1}^n (\theta_k^{(j)}) ^2$$
Gradient descent update:
$$\theta_0^{(j)} = \theta_0^{(j)} - \alpha \sum_{i:r(i, j) = 1} \left ( (\theta ^{(j)}) ^T x^{(i)} - y ^{(i, j)} \right ) x_0^{(i)}$$
$$\theta_k^{(j)} = \theta_k^{(j)} - \alpha \left ( \sum_{i:r(i, j) = 1} \left ( (\theta ^{(j)}) ^T x^{(i)} - y ^{(i, j)} \right ) x_k^{(i)} + \lambda \theta_k^{(j)} \right ) \quad (\text {for} \ k \ne 0)$$
Collaborative Filtering
Collaborative Filtering
Feature learning: an algorithm learns what features to use for itself.
In reality, it is very difficult and time-consuming to let someone to watch every movie and tell how romantic or how action packed the movie is. Hence, we have no idea how romantic or action packed each movie is, and features in the table above are represented as “?”.
Then, each of our users would tell how much they like a movie by telling what the value of $\theta ^{(j)}$ is.
$\theta^{(1)} = \begin{bmatrix} 0 \cr 5 \cr 0 \end{bmatrix}, \ \theta^{(2)} = \begin{bmatrix} 0 \cr 5 \cr 0 \end{bmatrix}, \ \theta^{(3)} = \begin{bmatrix} 0 \cr 0 \cr 5 \end{bmatrix}, \ \theta^{(1)} = \begin{bmatrix} 0 \cr 0 \cr 5 \end{bmatrix}$
The values of $x_1$ and $x_2$ for each movie can be inferred if we get these parameters $\theta^{(j)}$.
Optimization algorithm
Given $\theta_{(1)}, …, \theta_{(n_u)}$, to learn $x^{(i)}$:
$$\underset {x^{i}}{min} \frac 12 \sum_{j:r(i, j) = 1} \left( (\theta^{(j)}) ^T x^{(i)} - y^{(i , j)} \right)^2 + \frac {\lambda}2 \sum_{k=1}^n (x_k^{(i)}) ^2$$
Given $\theta_{(1)}, …, \theta_{(n_u)}$, to learn $x^{(1)}, …, x^{(n_m)}$:
$$\underset {x^{1}, …, x^{n_m}}{min} \frac 12 \sum_{i=1}^{n_m} \sum_{j:r(i, j) = 1} \left( (\theta^{(j)}) ^T x^{(i)} - y^{(i , j)} \right)^2 + \frac {\lambda}2 \sum_{i=1}^{n_m} \sum_{k=1}^n (x_k^{(i)}) ^2$$
Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
Collaborative Filtering Algorithm
Collaborative filtering optimization objective: Minimizing $x_{(1)}, …, x_{(n_m)}$ and $\theta_{(1)}, …, \theta_{(n_m)}$ simultaneously:
$$J(x_{(1)}, …, x_{(n_m)}, \theta_{(1)}, …, \theta_{(n_u)}) = \frac 12 \sum_{(i, j):r(i, j) = 1} \left( (\theta^{(j)}) ^T x^{(i)} - y^{(i , j)} \right)^2 + \frac {\lambda}2 \sum_{i=1}^{n_m} \sum_{k=1}^n (x_k^{(i)}) ^2 + \frac {\lambda}2 \sum_{i=1}^{n_u} \sum_{k=1}^n (\theta_k^{(j)}) ^2$$
$$\underset {x_{(1)}, …, x_{(n_m)} \\ \theta_{(1)}, …, \theta_{(n_u)}}{min} J(x_{(1)}, …, x_{(n_m)}, \theta_{(1)}, …, \theta_{(n_u)})$$
With the new algorithm, we are going to learn features $x \in \mathbb R^n$ and parameters $\theta \in \mathbb R^n$, because we are now learning all the features and th ere is no need to hard code the feature that is always equal to 1. If the algorithm really wants a feature equal to 1, it can choose to learn one for itself by setting $x_1 = 1$ instead of $x_0 = 1$.
Collaborative filtering algorithm
- Initialize $x_{(1)}, …, x_{(n_m)}, \theta_{(1)}, …, \theta_{(n_u)}$ to small random values.
- Minimize $J(x_{(1)}, …, x_{(n_m)}, \theta_{(1)}, …, \theta_{(n_u)})$ using gradient descent (or an advanced optimization algorithm). E.g. for every $j = 1,…, n_u, i = 1, …, n_m$:
$x_k^{(i)} = x_k^{(i)} - \alpha \left( \sum_{j:r(i, j) = 1} \left((\theta^{(i)}) ^T x^{(i)} - y^{(i, j)} \right) \theta_k^{(j)} + \lambda x_k^{(i)} \right)$
$\theta_k^{(j)} = \theta_k^{(j)} - \alpha \left( \sum_{i:r(i, j) = 1} \left((\theta^{(i)}) ^T x^{(i)} - y^{(i, j)} \right) x_k^{(i)} + \lambda \theta_k^{(j)} \right)$ - For a user with parameter $\theta$ and a movie with (learned) features $x$, predict a star rating of $\theta^T x$.
Low Rank Matrix Factorization
Vectorization: Low Rank Matrix Factorization
Group all the ratings into a matrix:
$Y = \begin{bmatrix} 5 & 5 & 0 & 0 \cr 5 & ? & ? & 0 \cr ? & 4 & 0 & ? \cr 0 & 0 & 5 & 4 \cr 0 & 0 & 5 & ? \end{bmatrix}$
Given this matrix $Y$ of all the ratings, there is an alternative way of writing out all the predictive ratings of the algorithm. What user $j$ predicts on movie $i$ is given by this formula: $(\theta^{(j)}) ^T x^{(i)}$.
Let $X = \begin{bmatrix} —(x^{(1)}) ^T— \cr —(x^{(2)}) ^T— \cr \vdots \cr —(x^{(n_m)}) ^T— \end{bmatrix}$, $\Theta = \begin{bmatrix} —(\theta^{(1)}) ^T— \cr —(\theta^{(2)}) ^T— \cr \vdots \cr —(\theta^{(n_u)}) ^T— \end{bmatrix}$
Predict ratings by computing the following matrix (low rank matrix factorization):
$$
\begin{bmatrix}
(\theta^{(1)}) ^T (x^{(1)}) & (\theta^{(2)}) ^T (x^{(1)}) & \cdots & (\theta^{(n_u)}) ^T (x^{(1)}) \cr
(\theta^{(1)}) ^T (x^{(2)}) & (\theta^{(2)}) ^T (x^{(2)}) & \cdots & (\theta^{(n_u)}) ^T (x^{(2)}) \cr
\vdots & \vdots & \vdots & \vdots \cr
(\theta^{(1)}) ^T (x^{(n_m)}) & (\theta^{(2)}) ^T (x^{(n_m)}) & \cdots & (\theta^{(n_u)}) ^T (x^{(n_m)})
\end{bmatrix} = \Theta^T X
$$
Using the learned features to find related movies:
For each product $i$, we learn a feature vector $x^{(i)} \in \mathbb R^n$
Implementational Detail: Mean Normalization
In addition to the previous 4 users, add a new user, Eve, who has not rated any movies.
Therefore, the ratings matrix $Y = \begin{bmatrix} 5 & 5 & 0 & 0 & ? \cr 5 & ? & ? & 0 & ? \cr ? & 4 & 0 & ? & ? \cr 0 & 0 & 5 & 4 & ? \cr 0 & 0 & 5 & ? & ? \end{bmatrix}$
Let $n = 2$, so we are going to learn parameter vector $\theta^{(5)} \in \mathbb R^2$ with 2 features. The user Eve has not rated any movies, so the first term of the objective function, $\sum_{(i, j): r(i, j) = 1} \left( (\theta^{(j)}) ^T x^{(i)} - y^{(i, j)} \right)^2$,is equal to 0, playing no role in determing $\theta^{(5)}$. Besides, there are no movies that Eve has rated, so the second term is equal to 0, too.
Hence, only the last term has an effect on the $\theta^{(5)}$.
$\frac {\lambda}2 \sum_{j=1}^{n_u} \sum_{k=1}^n = \frac {\lambda}2 [(\theta_1^{(5)}) ^2 + (\theta_2^{(5)}) ^2]$
However, the right term will end up with a vector of all zeros for minimizing, which means $\theta^{(5)} = \begin{bmatrix} 0 \cr 0 \end{bmatrix}$. Apparently, it is not useful to predict that Eve rates every movie 0 stars.
Mean Normalization:
Compute the average rating of each movie:
$\mu = \begin{bmatrix} (5+5+0+0) / 4 \cr (5+0) / 2 \cr (4+0) / 2 \cr (0+0+5+4) / 4 \cr (0+0+5+0) / 4 \end{bmatrix} = \begin{bmatrix} 2.5 \cr 2.5 \cr 2 \cr 2.25 \cr 1.25 \end{bmatrix} $
Normalize each movie to have an average rating of 0:
$\bar{Y} = Y - \mu = \begin{bmatrix} 2.5 & 2.5 & -2.5 & -2.5 & ? \cr 2.5 & ? & ? & -2.5 & ? \cr ? & 2 & -2 & ? & ? \cr -2.25 & -2.25 & 2.75 & 1.75 & ? \cr -1.25 & -1.25 & 3.75 & -1.25 & ? \end{bmatrix}$
Use the collaborative filtering algorithm to learn parameter $\theta^{j}$ and $x^{(i)}$ from the mean normalized movie ratings above.
For user $j$ on movie $i$, predict:
$\qquad (\theta^{(j)}) ^T (x^{(i)}) + \mu_i$
User 5 (Eve):
$\because \theta^{(5)} = \begin{bmatrix} 0 \cr 0 \end{bmatrix}$
$\therefore (\theta^{(5)}) ^T (x^{(i)}) = 0$
To sum up, if there is a movie without any rating, then this movie should not be recommended to anyone, so the case of a user who has not rated any movies might be more important than the case of a movie that has not gotten a single rating.
Notes: Machine Learning by Andrew Ng (Week8)
Clustering
Unsupervised Learning: Introduction
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
Apllications of clustering:
- Market segmentation
- Social network analysis
- Organize computing clusters
- Astronomical data analysis
K-Means (An Iterative Algorithm)
For example, set K = 2 which means we want to group the data into two clusters.
($\mathbf {1}$) Cluster Assignment Step:
Firstly, randomly initialize two points, called the cluster centroids.
Secondly, go through your data set and color each of the points either red or blue, depending on whether it is closer to the red cluster centroid or the blue cluster centroid.
($\mathbf {2}$) Move Centroid Step:
Take the two cluster centroids, the red cross and the blue cross, move them to the average of the points colored the same color. So what we are going to do is look at all the red points and compute the average (the mean of the location of all the red points), and move the red cluster centroid there. Then, do the same things for the blue cluster centroid.
($\mathbf {3}$) Iterate (1) ~ (2) until K-Means has converged:
The cluster centroids will not change any further and the colors of the points will not change any further.
k-means algorithm:
Input:
- $K$ (number of clusters)
- Training set {$x^{(1)}$, $x^{(2)}$,…, $x^{(m)}$}, $x^{(i)} \in \mathbb R^n$ (drop $x_0 = 1$)
Randomly initialize $K$ cluster centroids $\mu_1$,$\mu_2$,…,$\mu_K \in \mathbb R^n$
Repeat {
// Cluster Assignment Step:
$for\ i = 1: m$
$c^{(i)} = \underset{k}{min} |x^{(i)} - \mu_k|^2$ // index from 1 to $K$ of cluster centroid closest to $x^{(i)}$
// Move Centroid Step:
$for\ k = 1: K$
$\mu^{(i)} = avg(\sum x^j)$ // average (mean) of points assigned to cluster k
}
Let $\mu_k$ be the average of the points assigned to a certain cluster. However, there is a cluster centroid with zero points assigned to it. In that case the more common thing to do is to just eliminate that cluster centroid, ending up with $K-1$ clusters instead of $K$ clusters. Sometimes if it is necessary to have $K$ clusters, just randomly reinitialize that cluster centroid.
Optimization Objective
$\mu_{c^{i}}$ = cluster centroid of cluster to which example $x^{(i)}$ has been assigned
k-means optimization objective: try to minimize the cost function $J$ (the distortion function)
$$J(c^{(1)}, …, c^{(m)}, \mu^{(1)}, …, \mu^{(K)}) = \frac 1m \sum_{i=1}^m | x^{(i)} - \mu_{c^{(i)}} |^2$$
$$\underset{c^{(1)}, …, c^{(m)}, \mu^{(1)}, …, \mu^{(K)}}{min} J(c^{(1)}, …, c^{(m)}, \mu^{(1)}, …, \mu^{(K)})$$
- Cluster assignment step:
minimize $J$ with respect to the variables $c_1$,$c_2$,…,$c_m$ while holding the cluster centroid $\mu_1$,…,$\mu_K$ fixed. - Move centroid step:
minimize $J$ with respect to the locations of the cluster centroids $\mu_1$,…,$\mu_K$
The cost function $J$ can also be used to debug other algorithm and make sure the implementation of k-means is running correctly.
Random Initialization
- Should have $K$ < m.
- Randomly pick $K$ training examples.
- Set $\mu_1$,…,$\mu_K$ equal to these $K$ examples.
k-means could end up converging to different solutions depending on how the clusters are initialized, in another world, depending on the random initialization. In particular, k-means could actually end up at the local optima:
Try multiple random initialization to increase the odds of k-means finding the best possible clustering.
for i = 1 : n (n $\in$ [50, 1000])
Randomly initialize k-means;
Run k-means and get $c^{(1)}, …, c^{(m)}, \mu^{(1)}, …, \mu^{(K)}$;
Compute cost function $J(c^{(1)}, …, c^{(m)}, \mu^{(1)}, …, \mu^{(K)})$;
}
Finally, pick clustering that gives lowest cost $J(c^{(1)}, …, c^{(m)}, \mu^{(1)}, …, \mu^{(K)})$
Attention:
- If $K \in$ [2, 10], then doing multiple random initialization can often make sure of finding a better local optima.
- If $K \gg$ 10, having multiple random initialization is less likely to make a great difference, and there is a much higher chance that the first random initialization will have already gave a pretty decent solution.
Choosing the Number of Clusters
Elbow Method: run k-means with 1, 2,…, $K$ clusters, and compute the distortion $J$ and plot those values. It is not used often because sometime it will end up with a curve that looks much more ambiguous.
The most common way of choosing the number of clusters is by choosing it manually by human input or human insight. Or, a better way to determine the number of clusters is to see how well different numbers of clusters serve the later downstream purpose.
Dimensionality Reduction
Motivation I: Data Compression
Dimensionality reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. Data compession makes data use up less computer memory or disk space, in addition to speeding up the learning algorithm.
Motivation II: Visualization
Use dimensionality reduction to reduce data from n dimensions (n > 3) down to 2 or 3 dimensions, so that the new data can be plotted and be understood better.
Principal Component Analysis
Principal Component Analysis Problem Formulation
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.
For example, in the following, we want to reduce data from 2-dimensional to 1-dimensional, in another word, the data will be projected onto the red line or the magenta line. The projection error (the blue line segments) will be huge if the data is projected onto the magenta line. That’s why PCA will choose the red line rather than the magenta line here.
The goal of PCA:
- 2D $\rightarrow$ 1D: to find a vector $u^{(1)} \in \mathbb R^n$ onto which to project the data so as to minimize the projection error.
- 3D $\rightarrow$ 2D: to find a pair of vector, $u^{(1)}$ and $u^{(2)}$, these two vectors define a plane.
- nD $\rightarrow$ kD: to find $k$ vectors $u^{(1)}$, $u^{(2)}$, …, $u^{(k)}$ onto which to project the data, so as to minimize the projection error, where we can define the position of the points in a plane using k directions (vectors).
In short, PCA tries to find a lower dimensional surface onto which to project the data, so as to minimize the square projection error (orthoginal projection error).
PCA VS linear regression:
Principal Component Analysis Algorithm
($\mathbf {1}$) Data preprocessing:
Training set: $x^{(1)}$, $x^{(2)}$, …, $x^{(m)}$
Preprocessing (feature scaling / mean normalization)
- $\mu_j = \frac 1m \sum_{i=1}^m x_j^{(i)}$
- Replace each $x_j^{(i)}$ with $x_j - \mu_j$.
- If different features on different scales (e,g., $x_1$ = size of house, $x_2$ = number of bedrooms), scale features to have comparable range of values: $x_j^{(i)} = \frac {x_j - \mu_j}{s_j}$, $s_j$. could be $max -min$ or standard deviation of feature $j$.
($\mathbf {2}$) PCA algorithm:
Reduce data from $n$-dimensions to $k$-dimensions:
Compute “covariance matrix”:
$$\Sigma(Sigma) = \frac 1m \sum_{i=1}^n (x^{(i)}) (x^{(i)}) ^T $$
Compute “eigenvectors” of matrix $\Sigma$:
$\qquad \color{darkblue} {[U, S, V] = svd(Sigma)}$;
And we get:
$$
U = \begin{bmatrix}
| & | & & | \\
u^{(1)} & u^{(2)} & \cdots & u^{(n)} \\
| & | & & |
\end{bmatrix} \in \mathbb R^{n \times n}
$$
Use the symbol $U_{reduce}$ to represent the matrix including the preceding $k$ columns of the matrix $U$:
$$
U_{reduce} = \begin{bmatrix}
| & | & & | \\
u^{(1)} & u^{(2)} & \cdots & u^{(k)} \\
| & | & & |
\end{bmatrix} \in \mathbb R^{n \times k}
$$
$$
z^{(i)} = (U_{reduce})^T x ^{(i)} = \begin{bmatrix}
— (u^{(1)}) ^T— \\
\vdots \\
— (u^{(k)}) ^T—
\end{bmatrix}x^{(i)}
$$
$\because (U_{reduce})^T \in \mathbb R^{k \times n}$, $x^{(i)} \in \mathbb R^n$
$\therefore z^{(i)} \in \mathbb R^k$
By the way, if the data is represented as a matrix like $X = \begin{bmatrix} — (x^{(1)}) ^T— \\ \vdots \\ — (x^{(m)}) ^T— \end{bmatrix}$, then $Sigma = \frac 1m X’ X$.
Applying PCA
Reconstruction from Compressed Representation
Reconstruction: take the low representation $z$ and map it back up to an approximation of the original high-dimensional data $x$, by $x \approx x_{approx} = U_{reduce} \cdot z$:
Choosing the Number of Principal Components
The PCA algorithm takes $n$-dimensional features and reduces them to some $k$-dimensional features representation. This number $k$, also called the number of principle components, is a parameter of the PCA algorithm.
Average squared projection error: $\frac 1m \sum_{i=1}^m | x^{(i)} - x_{approx}^{(i)} |^2$
Total variation in the data: $\frac 1m \sum_{i=1}^m | x^{(i)} |^2$
Typically, choose $k$ to be the smallest value so that
$$\frac {\frac 1m \sum_{i=1}^m | x^{(i)} - x_{approx}^{(i)} |^2} {\frac 1m \sum_{i=1}^m | x^{(i)} |^2} \le 0.01$$
which means “99% of variance is retained”. If the fraction $\le 0.05$, which means “95% of variance is retained”.
The algorithm of choosing $k$:
-
Method ($\mathbf 1$): inefficient!
for k = 1: n-1 (maybe)
Try PCA with k;
Compute $U_{reduce}$, $z^{(1)}$, $z^{(2)}$,…, $z^{(m)}$, $x_{approx}^{(1)}$,…, $x_{approx}^{(m)}$
Check if$\frac {\frac 1m \sum_{i=1}^m | x^{(i)} - x_{approx}^{(i)} |^2} {\frac 1m \sum_{i=1}^m | x^{(i)} |^2} \le 0.01$? break : continue; -
Method ($\mathbf 2$): much more efficient
$\color{darkblue}{[U, S, V] = svd(Sigma)}$, $S \in \mathbb R^{n \times n}$ is a diagonal matrix.
$$ S = \begin{bmatrix}
S_{11} & & & \\
& S_{22} & & \\
& & \ddots & \\
& & & S_{nn}
\end{bmatrix} $$
Pick the smallest value of $k$ for which $\frac {\sum_{i=1}^k S_{ii}}{\sum_{i=1}^m S_{ii}} \ge 0.99$ (99% of variance retained).
Advice for Applying PCA
Speed up supervised learning:
($x^{(1)}$, $y^{(1)}$), ($x^{(2)}$, $y^{(2)}$), …, ($x^{(3)}$, $y^{(3)}$)
Extract inputs:
$x^{(1)}$, $x^{(2)}$,…, $x^{(m)} \in \mathbb R^{10000}$ (unlabeled dataset)
Apply PCA: $\downarrow$
$z^{(1)}$, $z^{(2)}$,…, $z^{(m)} \in \mathbb R^{1000}$
New training set:
($z^{(1)}$, $y^{(1)}$), ($z^{(2)}$, $y^{(2)}$), …, ($z^{(3)}$, $y^{(3)}$)
Finally, use the reduced dimension training set $Z$ and feed it to a learning algorithm (a neural network or logistic regression). The input can be used to learn the hypothesis and make predictions.
Note: Mapping $x^{(i)} \rightarrow z^{(i)}$ should be defined by running PCA only on the training set. This mapping can be applied as well to the example $x_{cv}^{(i)}$ and $x_{test}^{(i)}$ in the cross validation and test sets.
Application of PCA:
- Compression:
($\mathbf 1$) Reduce memory / disk needed to store data
($\mathbf 2$) Speed up learning algorithm - Visualization
Bad use of PCA: To prevent overfitting:
Use $z{(i)}$ instead of $x{(i)}$ to reduce the number of features to $k < n$, Thus, fewer features, less likely to overfit. This might work OK, but it is not a good way to address overfitting. Use regularization instead.
In the end, before implement PCA, first try to do with the original / raw data $x{(i)}$. Only if that does not meet the demand or expectation, then implement PCA and consider to use $z{(i)}$.
Notes: Machine Learning by Andrew Ng (Week7)
Large Margin Classification
Optimization Objective
Alternative view of logistic regression: $h_{\theta}(x) = \frac 1{1 + e^{-\theta ^T x}}$
Cost function: $$\mathrm{} \left( y\log h_{\theta}(x) + (1 - y) \log \left(1 - h_{\theta}(x) \right) \right) = -y \log \frac 1{1 + e^{-\theta ^T x}} - (1 - y) \log (1 - \frac 1{1 + e^{-\theta ^T x}})$$
If $\ y = 1$, we want $\ \theta^T x \gg 0$
If $\ y = 0$, we want $\ \theta^T x \ll 0$
Modify the original cross function a little bit. The new curve of the cost function when y = 1 is made up of two line segments, there is the flat portion on the right and the straight line portion on the left. The slope of the straight line does not matter that much:
The new cost function will do something pretty similar to logistic regression and give the support vector machine (SVM) computational advantages.
The other case if y = 0:
Logistic regression:
$$\underset{\theta}{min} \left[ \frac 1m \sum_{i=1}^m y^{(i)} (-\log h_{\theta} (x^{(i)}) + (1 - y^{(i)}) \left(-\log \left( 1 - h_{\theta} (x^{(i)}) \right) \right) + \frac {\lambda}{2m} \sum_{j=1}^n \theta_j ^2 \right]$$
For the SVM, first, we take the log functions and replace them with $cost_0(z)$ and $cost_1(z)$. Second, we get rid of the term $\frac 1m$ because it does not change the value of $\theta$. Third, we use the parameter $C = \frac 1{\lambda}$ instead of the regularization term in the logistic regression. The parameter $C$ is a positive value that controls the penalty for misclassified training examples. A large parameter $C$ tells the SVM to try to classify all the examples correctly.
The optimization objective function for the SVM:
$$\underset{\theta}{min} \ C \sum_{i=1}^m \left[ y^{(i)} cost_1(\theta^T x^{(i)}) + (1 - y^{(i)}) cost_0(\theta^T x^{(i)}) \right] + \frac 12 \sum^n_{i=1} \theta_j^2$$
Large Margin Intuition
If $\ y = 1$, we want $\ \theta^T x \ge 1$ (not just $\ge 0$)
If $\ y = 0$, we want $\ \theta^T x \le -1$ (not just < 0)
SVM Decision Boundary:
Assume $C$ is a very large value, then when minimizing the optimization objective, the following term must infinitely approach to zero:
$$\sum_{i=1}^m \left[ y^{(i)} cost_1(\theta^T x^{(i)}) + (1 - y^{(i)}) cost_0(\theta^T x^{(i)}) \right] \approx 0$$.
Which means:
- Whenever $y^{(i)} = 1$:
$\because (1 - y^{(i)}) cost_0 (\theta^T x^{(i)}) = 0$, in order to make the whole term zero, $cost_1 (\theta^T x^{(i)})$ should be equal to zero,
$\therefore \theta^T x^{(i)} \ge 1$ - Whenever $y^{(i)} = 0$:
$\because y^{(i)} cost_1 (\theta^T x^{(i)}) = 0$, in order to make the whole term zero, $cost_0 (\theta^T x^{(i)})$ should be equal to zero,
$\therefore \theta^T x^{(i)} \le -1$
Therefore, the optimization objective becomes:
$$\require{enclose} \underset{\theta}{min} (\enclose{horizontalstrike}{C \times 0} + \frac 12 \sum_{i=1}^n \theta_j^2)$$
$\begin{align*} \text{s.t} \quad &\theta^T x^{(i)} \ge 1 \ &\text{if} \ \ y^{(i)} = 1 \newline &\theta^T x^{(i)} \le -1 \ &\text{if} \ \ y^{(i)} = 0 \end{align*}$
We will get a decision boundary when minimizing this as a funtion of the parameter $\theta$.
SVM Decision Boundary: Linearly separable case
The Support Vector Machine will choose the black line as decision boundary. Mathematically, this black decision boundary has a larger distance.
That distance is called the margin. The black decision boudnary has some larger minimum distance from any of the training examples, which means it tries to separate the data with as a large a margin as possible.
SVM is also called a large margin classifier.
Large margin classifier in presence of outliers
If the regularization parameter $C$ is very large, $\ \because C = \frac 1{\lambda}$, $\ \therefore \lambda$ is small, then it will change the decision boundary from the black to the magenta one. However, if $C$ is small, i.e. $\lambda$ is large, it ill end up with the original black decision boundary.
If the data is not linearly separable, the SVM is still able to do the right thing and to ignore the few outliers when $C$ is small.
Mathematics Behind Large Margin Classification
Kernels
Kernel I
Kernel II
SVMs in Practice
Using An SVM
Notes: Machine Learning by Andrew Ng (Week1)
Introduction
What Is Machine Learning?
- The field of study that gives computers the ability to learn without being explicitly programmed.
- A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
Example: playing checkers.
E = the experience of playing many games of checkers;
T = the task of playing checkers;
P = the probability that the program will win the next game;
Supervised Learning
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into “regression” and “classification” problems:
- Regression: predict results within a continuous output, meaning that map input variables to some continuous function.
- Classification: predict results in a discrete output. In other words, map input variables into discrete categories.
Unsupervised Learning
Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.
Example:
-
Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.
-
Non-clustering: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).
Model and Cost Function
Model Representation
Note that the superscript “(i)” in the notation is simply an index into the training set.
$x^{(i)}$: input (input variables or input features)
$y^{(i)}$: output (target variable)
$(x^{(i)}, y^{(i)})$: a training example;
X: the space of input values;
Y: the space of output values;
To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function (hypothesis) $h : X \rightarrow Y$ so that h(x) is a “good” predictor for the corresponding value of y. Seen pictorially, the process is therefore like this:
Cost Function - To Measure The Accuracy of The Hypothesis Function
Squared error function (Mean squared error):
$$J(\theta_0, \theta_1) = \frac 1{2m}\sum_{i=1}^m (h_\theta(x_i) - y_i)^2$$
Cost Function - Intuition I
If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are trying to make a straight line (defined by $h_\theta(x)$) which passes through these scattered data point.
Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. Ideally, the line should pass through all the points of our training data set. In such a case, the value of $J(\theta_0, \theta_1)$ will be 0.
Cost Function - Intuition II
A contour line of a two variable function has a constant value at all points of the same line. Giving our hypothesis function a slightly positive slope results in a better fit of the data. The graph in the following minimizes the cost function as much as possible and consequently, the result of $\theta_0$ and $\theta_1$ tend to be around 0.12 and 250 respectively. Plotting those values on our graph to the right seems to put our point in the center of the inner most ‘circle’.
Parameter Learning
Gradient Descent
Now we need to estimate the parameters in the hypothesis function.
Imagine that we graph our hypothesis function based on its fields $\theta_0$ and $\theta_1$ (actually we are graphing the cost function as a function of the parameter estimates). We are graphing the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters:
We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum points in the graph.
The way we do this is by taking the derivative of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the learning rate $\alpha$.
For example, the distance between each ‘star’ in the graph above represents a step determined by our parameter $\alpha$. A smaller $\alpha$ would result in a smaller step and a larger $\alpha$ results in a larger step. The direction in which the step is taken is determined by the partial derivative of $J(\theta_0, \theta_1)$. Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.
The gradient descent algorithm is (repeat until convergence):
$$\theta_j = \theta_j - \alpha\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$$
where j = 0, 1 represents the feature index number.
At each iteration j, one should simultaneously update the parameters $\theta_1, \theta_2, …, \theta_n$. Updating a specific parameter prior to calculating another one on the $j^{th}$ iteration would yield to a wrong implementation.
$temp0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)$
$temp1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)$
$\theta_0 := temp0$
$\theta_1 := temp1$
Gradient Descent Intuition
We should adjust our parameter $\alpha$ to ensure that the gradient descent algorithm converges in a reasonable time.
- If $\alpha$ is too small, gradient descent can be slow.
- If $\alpha$ is too large, gradient descent can overshoot the minimum. It may fail to converge, or event diverge.
Gradient descent can converge to a local minimum, even with the learning rate $\alpha$ fixed. As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease $\alpha$ over time. The intuition behind the convergence is that $\frac d{d \theta1} J(\theta_1)$ approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get: $\theta_1$ := $\theta_1 - \alpha \cdot 0$.
Gradient Descent for Linear Regression
When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to:
repeat until convergence: {
$\qquad\theta_0 := \theta_0 - \alpha \frac 1m \sum_{i=1}^m(h_{\theta}(x_i) - y_i)$
$\qquad\theta_1 := \theta_1 - \alpha \frac 1m \sum_{i=1}^m((h_{\theta}(x_i) - y_i) \cdot x_i)$
}
$\begin{aligned} \frac {\partial}{\partial\theta_j}J(\theta) &= \frac {\partial}{\partial \theta_j} [\frac 12 (h_{\theta}(x) - y)^2]\\ &= 2 \cdot \frac 12 (h_{\theta}(x) - y) \cdot \frac {\partial}{\partial \theta_j} (h_{\theta}(x) - y)\\ &= (h_{\theta}(x) - y) \cdot \frac {\partial}{\partial\theta_j} (\sum^n_{i=0}\theta_i x_i - y) \\ &= (h_{\theta}(x) - y) \cdot x_j \quad(j \neq 0) \end{aligned}$
When $j = 0$,
$\because h_{\theta}(x) = \theta X = \theta_0 x_0 + \theta_1 x_1 = \theta_0 + \theta_1 x_1 \quad (x_0 = 1)$
$\therefore \frac {\partial}{\partial\theta_0}(h_{\theta}(x) - y) = 1$
$\therefore \frac {\partial}{\partial\theta_0} J(\theta) = (h_{\theta}(x) - y)$
The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.
Thus, this is simply gradient descent on the original cost function $J$. This method looks at every example in the entire training set on every step, and is called batch gradient descent. Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local optimal.